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CHAPTER  1: 
INTRODUCTION 


1.1  Motivation 


In  recent  years,  the  blogging  phenomenon  has  dramatically  changed  how  Internet  users  ac¬ 
cess  and  share  information.  Ongoing,  periodic  publication  of  news  and  opinions  for  an  open 
audience  was  once  restricted  to  newspaper  and  magazine  publishers.  Even  in  the  digital  age, 
publishing  to  the  Internet  was  formerly  restricted  to  large  businesses  and  only  the  most  techni¬ 
cally  savvy.  Free  availability  to  many  public,  easy-to-use  blogging  host  services  has  lowered 
that  bar  such  that  the  only  requirements  for  sharing  your  writing  with  the  world  are  a  computer 
with  Internet  access  and  something  to  say. 

The  influence  of  individual  blogs  on  reporting  of  news  has  made  many  of  their  authors  into 
celebrity  writers  that  often  drive  or  even  outshine  the  traditional  outlets  entirely.  Niche  blogs 
allow  writers  and  readers  to  seek  each  other  out  and  connect  for  ongoing  commentary  on  even 
the  most  obscure  topics.  A  single  individual  acts  as  the  writer,  editor  and  publisher,  removing 
the  revising  and  filtering  process  of  traditional  publishing  and  facilitating  posting  of  raw,  opin¬ 
ionated,  and  controversial  blogs  if  the  author  desires.  Further,  this  individual  may  or  may  not 
choose  to  reveal  his  or  her  true  identity. 

Myriad  situations  exist  in  which  we  may  wish  to  discover  the  author  of  some  anonymous  elec¬ 
tronic  communication,  whether  a  blog  post,  a  comment  on  a  blog,  content  on  a  “wiki,”  a  mes¬ 
sage  board  post,  chat  messages,  or  an  anonymous  email.  The  motivation  for  discovering  the 
author’s  identity  could  range  from  forensic  evidence  gathering  in  criminal  proceedings,  intel¬ 
ligence  analysis,  revealing  or  authenticating  a  “whistle  blower,”  or  simple  curiosity.  In  the 
absence  of  other  identifying  information  such  as  the  originating  computer’s  IP  address  or  con¬ 
nection  logs,  the  text  itself  may  be  our  only  method  of  discovering  the  true  author  of  an  anony¬ 
mous  message.  When  the  list  of  suspect  or  potential  authors  is  extremely  large,  this  becomes  a 
daunting  task.  Application  of  machine  learning  techniques,  however,  could  allow  the  list  to  be 
dramatically  reduced,  ideally  to  a  single  individual  or  a  set  which  is  a  fraction  of  the  size  of  the 
original,  making  the  job  of  a  human  investigator  much  more  manageable. 
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1.2  Organization  of  Thesis 

In  Chapter  1  we  discuss  the  motivation  for  examining  authorship  attribution  of  electronic  doc¬ 
uments,  specifically  blogs,  due  to  the  potential  impact  of  studies  on  real-world  investigations. 
Chapter  1  also  introduces  the  concept  of  authorship  discovery,  in  contrast  to  traditional  author¬ 
ship  attribution.  In  Chapter  2  we  first  outline  the  foundations  of  computational  attribution  from 
the  earliest  studies  to  modern  techniques.  Chapter  2  also  outlines  the  characteristic  language  of 
blogs  and  how  they  compare  to  other  forms  of  written  text. 

Chapter  3  presents  an  experiment  in  authorship  attribution  using  a  blog  corpus.  First  we  discuss 
the  corpus  preparation,  including  motivation  for  using  blogs  as  a  testbed.  Second,  we  detail 
the  classification  scheme  using  a  Bayesian  classifier.  Third,  we  introduce  a  corrective  scaling 
factor  which,  applied  to  the  results  of  a  classification,  improve  results  dramatically.  Finally, 
we  propose  a  metric  forjudging  success  of  classification  by  reducing  the  search  space  to  some 
threshold  in  scenarios  where  the  search  space  is  extremely  large. 

Chapter  4  discusses  the  results  of  our  experiment  in  Bayesian  classification  including  qualitative 
discussion  of  the  concept  of  relaxing  the  n-percent-correct  threshold  in  real-world  problems. 
Chapter  4  also  discusses  the  possibility  of  a  critical  flaw  in  our  approach,  which  arises  from  the 
inclusion  of  content  words,  and  must  be  regarded  with  caution. 

Finally,  Chapter  5  presents  a  brief  review  and  proposes  several  directions  in  which  the  study  of 
authorship  discovery  in  blogs  can  continue  to  move  forward. 
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CHAPTER  2: 
BACKGROUND  TOPICS 


In  this  chapter  we  discuss  the  foundations  for  computational  authorship  attribution.  First,  a 
survey  of  existing  techniques  for  discovering  authorship  are  explored.  Second,  we  explore  the 
reasoning  behind  examining  online  blogs  and  specifically  their  use  as  a  corpus  for  authorship 
studies.  Finally,  classification  using  the  Naive  Bayes  classifier,  the  primary  algorithm  used  in 
our  research,  is  explained. 

2.1  Authorship  Attribution 

2.1.1  Research  Scope 

The  task  of  authorship  attribution  can  be  defined  as  a  structured  method  of  determining  the  indi¬ 
vidual  who  generated  some  sample  of  text.  Specifically,  in  this  thesis  we  assume  that  the  task  is 
being  performed  strictly  on  electronic  text  files  with  no  markup  to  help  distinguish  between  au¬ 
thors  such  as  timestamps,  originating  computer  identification,  or  textual  formatting.  Tangential 
fields  such  as  handwriting  recognition  or  computer  forensics  are,  therefore,  not  discussed.  For 
the  purpose  of  constraining  the  problem  we  are  also  not  considering  the  possibility  that  a  human 
expert  could  subjectively  determine  the  author  of  a  sample  from  its  content  or  style  much  as  a 
literary  expert  might,  instead  restricting  our  study  to  computational  methods. 

The  task  of  authorship  attribution  is  also  not  strictly  aligned  with  the  task  of  authorship  verifi¬ 
cation.  The  task  of  validating  with  some  level  of  confidence  whether  a  single  suspect  individual 
is  the  true  author  of  a  sample  will  also  not  be  directly  explored.  In  this  thesis  we  address  the 
issue  of  authorship  attribution  or  what  may  even  be  thought  of  as  “authorship  discovery.” 

2.1.2  History  of  Authorship  Attribution 

T.C.  Mendenhall,  in  1887,  published  what  is  considered  the  first  scientific  study  of  authorship 
attribution  based  on  syntactic  characteristics  of  sample  texts.  In  [26],  his  approach  expands  on 
Augustus  DeMorgan’s  suggestion  that  comparing  mean  word  length  in  two  texts  could  be  an 
indicator  of  whether  they  were  written  by  the  same  individual.  Mendenhall  argues  that  the  mean 
word  length  is,  itself,  not  discriminating  enough,  but  supposes  that  comparing  a  histogram  of 
word  lengths,  which  he  calls  a  “characteristic  curve  of  composition,”  would  more  finely  resolve 
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the  differences  between  authors.  Relating  the  process  to  spectral  analysis  of  the  light  emitted 
when  elements  are  heated,  which  is  known  to  precisely  identify  the  element,  Mendenhall  sug¬ 
gests  that  an  author  may  generate  texts  in  the  same  uniquely  characteristic  manner  as  a  physical 
specimen  would  emit  light. 


Examples  of  Mendenhall’s  Curves  of  Composition  from  [26] 


Figure  2.1:  Histogram  demonstrating  “consistent” 
curve  between  samples  from  the  same  author, 
and  in  fact  the  same  text.  Visible  variance  is,  in 
Mendenhall’s  opinion,  due  to  the  relatively  small 
sample  size  of  1000  words. 


Figure  2.2:  Histogram  representing  curves  of  two 
different  authors.  Mendenhall,  though,  attributes 
their  virtual  similarity  to  “the  result  of  accident”  and 
claims  that  “it  would  not  be  likely  to  repeat  itself.” 


Mendenhall’s  results  were  understandably  limited.  Generating  curves  required  manual  count¬ 
ing  of  letters  in  sets  of  1000-5000  words  at  a  time  from  the  works  of  classic  authors.  His  initial 
paper,  as  well  as  a  follow-on  in  1901,  do  suggest  that,  given  a  large  enough  sample,  character¬ 
istic  curves  emerge  which  allow  discrimination  between  authors.  However,  the  example  curves 
in  figures  2.1  and  2.2  do  not  make  a  convincing  case  for  his  conclusions.  Mendenhall  recog¬ 
nizes  the  benefit  of  his  approach,  though,  as  “purely  mechanical  in  its  application,”  which  was 
a  new  concept  in  the  field.  This  is  in  contrast  to  the  subjective  analysis  that  a  literary  scholar 
might  perform  to  describe  the  differences  between  the  eloquence  of  Dickens  and  Thackeray,  for 
example.  Further,  Mendenhall  suggests  that  the  approach  could  be  equally  applied  to  counts  of 
syllables  or  histograms  of  word  counts  per  sentence. 

Building  on  Mendenhall’s  premise  that  textual  statistics  can  be  used  as  an  authorial  fingerprint, 
subsequent  researchers  have  sought  to  use  various  additional  measures,  both  in  the  same  manner 
as  Mendenhall’s  original  experiments  as  well  as  using  new  methods  of  analysis. 
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•  G.  U.  Yule,  in  1939,  counted  lengths  of  an  author’s  sentences,  concluding  that  “sentence- 
length  is  a  characteristic  of  an  author’s  style,”  but  that  the  judgement  of  authorship  must 
be  “a  personal  one,”  given  the  evidence  of  sentence  length  distributions  [35].  In  the  two 
specific  cases  Yule  presents,  he  does  make  conclusive  judgments  about  the  authorship  of 
disputed  texts,  demonstrating,  for  example,  that  Thomas  a  Kempis’  mean  sentence  length 
of  17.9  matched  that  of  Imatatio  Christi  (mean  of  16.2)  more  closely  than  Jean  Charlier  de 
Gerson,  the  once  believed  author,  at  23.4. 

•  Similarly,  Conrad  Mascol  evaluated  the  New  Testament  Epistles  using  a  measure  of  sen¬ 
tences  per  printed  page  [25],  determining  that  Paul  had  not  written  some  of  the  books  which 
scholars  believed  he  had. 

•  Wilhelm  Fucks  discriminated  between  authors  using  the  average  number  of  syllables  per 
word  and  average  distance  between  equal-syllabled  words  [8].  Fucks,  too,  concluded  that  a 
study  such  as  his  reveals  a  “possibility  of  a  quantitative  classification  which  is  very  simple 
to  realize,”  but  recognizes  that  his  measures  delineated  samples  largely  on  the  language, 
level  of  prose,  and  progressive  changes  in  style  through  historical  periods  rather  than  being 
strictly  indicative  of  authorship. 

•  In  [7],  R.  Forsyth,  D.  Holmes  and  E.  Tse  revisit  syllable  length  measures  to  demonstrate  that 
the  Renaissance  scholar  Sigonio  likely  faked  his  supposedly  complete  version  of  Cicero’s 
Consolatio,  which  had  previously  existed  only  in  fragments,  concluding  that  portions  use 
language  more  characteristic  of  the  Renaissance  than  classical  times. 


Trace,  s  — »» 

Figure  2.3:  W.  Fucks’  Diagram  from  [8]  relating  frequencies  of  n-syllable 
words  to  the  distance  between  words  of  the  same  number  of  syllables.  Fit 
lines  indicate  German  versus  English  language  texts  and  position  on  line 
is  indicative  of  mixture  of  prose  and  verse  styles  of  writing. 
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Further  Stylometric  methods  of  quantifying  style 

Extending  beyond  word  and  sentence  length  histograms,  several  other  textual  measures  have 
been  proposed  and  used  for  authorship  attribution  problems.  In  [13],  Holmes  asserts: 

One  of  the  fundamental  notions  in  computational  stylistics  is  the  measurement  of 
what  is  termed  the  “richness”  or  “diversity”  of  an  author’s  vocabulary.  The  basic 
assumption  is  that  the  writer  has  available  a  certain  stock  of  words,  some  of  which 
he/she  may  favour  more  than  others...  If,  furthermore,  we  can  find  a  single  mea¬ 
sure  which  is  a  function  of  all  the  vocabulary  frequencies  and  which  adequately 
characterizes  the  sample  frequency  distribution  we  may  then  use  that  measure  for 
comparative  purposes. 

Among  the  most  widely  used  measures  in  this  category  is  the  type-token  ratio,  a  representation 
of  the  number  of  unique  word  types,  V,  divided  by  the  counted  length  of  the  text  sample,  Nl .  In 
plain  terms,  this  measure  represents  the  breadth  of  the  author’s  vocabulary  used  in  the  sample  of 
interest.  Unfortunately,  the  type-token  ratio  has  limited  use  in  authorship  studies.  In  particular, 
type-token  ratio  is  unstable  with  the  size  of  the  document  and  it  may  be  highly  dependent  on 
other  factors  such  as  the  style  of  writing.  Type-token  ratio  does,  however,  lend  itself  as  an  easily 
understood  starting  point  for  understanding  the  quantification  of  an  author’s  “style.” 

Additional  stylometric  measures  include: 

•  Word  Frequency  Distributions 

One  implication  of  the  well-known  Zipf ’s  Law  for  text  samples  is  that  the  vast  majority 
of  word  types  in  a  text  are  used  infrequently,  with  most  of  a  text  sample  being  comprised 
of  only  a  small  set  of  types,  describing  a  “frequency  distribution  of  words  in  human 
languages.”  [24]  Supposing  that  this  distribution  may  vary  slightly  between  individual 
writers,  it  may  be  used  to  compare  authors.  In  particular,  counts  of  hapax  legomena, 
word  types  that  are  used  only  once,  and  hapax  dislegomena,  word  types  that  are  used  only 
twice,  have  been  proposed  as  measures  for  authorship  attribution  but  have  been  found  to 
be  lacking  on  their  own.  [14] 

'A  word  type  encompasses  all  occurrences  of  that  word  in  a  text  whereas  a  word  token  is  a  single  occurrence 
of  a  word  or  other  marker  in  the  text  such  that  multiple  occurrences  of  the  same  word  are  each  separate  tokens  but 
are  all  of  the  same  type. 
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•  Yule’s  Characteristic  (K) 

Defined  as  K  =  (104  Y2r  r2Vr  —  N)/N2  where  Vr  represents  the  number  of  words  that 
occur  r  times  in  the  sample  and  N  represents  the  total  number  of  tokens  in  the  sample. 
Yule’s  Characteristic  is  based  on  a  Poisson  distribution  and  describes  the  proportion  of 
words  in  a  sample  that  are  repeated  r  times,  weighted  by  r2.  [13] 

•  Simpson’s  Index  (D) 

n  =  V  ~  1 (r  =  19  1  =  V  ni<yUi  ~  ^ 

^ 2L,N(N-1) 

for  r  occurrences  greater  than  zero  or  all  i  types  in  V. 

Simpson’s  Index  measures  the  probability  that  two  tokens,  drawn  randomly  from  a  sample 
of  text,  will  be  of  the  same  type.  In  particular,  it  is  useful  for  comparing  texts  of  different 
lengths.  [31]  [13] 

•  Entropy 

Borrowing  from  the  thermodynamic  concept  of  entropy,  S  =  —k  JY  Pi  log  Pi,  where  p,  is 
the  probability  of  appearance  of  the  ith  lemma  and  k  is  an  arbitrary  constant,  represents 
the  measure  of  disorder  or  randomness  in  a  text  sample.  [13]  [8]  [4] 

•  “S”  measure  introduced  by  Golcher  [10] 

S(T,t)  =  Em,nlos(fr(sL,J  +  l)/£  Where  FT(s'mn )  is  the  number  of  occurrences 
of  s'mn  in  T.  In  practical  terms,  S  measures  how  frequently  substrings  of  characters 
of  all  lengths,  reminiscent  of  a  power  set,  are  repeated  in  a  text.  Golcher’s  published 
results  perform  comparably  with  other  methods,  including  “correct”  classification  of  all 
the  disputed  Federalist  papers. 

•  Gunning-Fog  Index,  Simple  Measure  of  Gobbledygook,  Automated  Readability  Index, 
Flesch  Reading  Ease,  Flesch-Kincaid  Grade  Fevel. 

In  [22],  Mala  borrows  several  novel  linguistic  measures,  most  of  which  represent  the 
complexity  of  a  text  as  a  level  of  linguistic  sophistication  by  quantifying  syllables  per  sen¬ 
tence,  for  example,  and  uses  a  3D  visualization  technique  to  product  on-screen  “objects” 
which  a  human  subject  can  quickly  and  naturally  determine  to  be  similar  or  dissimilar. 

Further,  a  multivariate  approach,  combining  or  comparing  several  different  measures  will  al¬ 
most  certainly  lend  them  even  greater  discriminating  power.  [14] 
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Lexical  Approaches  to  Authorship  Attribution 

Whereas  the  above  stylometric  approaches  to  authorship  attribution  seek  to  generalize  a  text 
sample  based,  in  most  cases,  on  statistics  of  its  construction,  a  somewhat  different  approach  is 
to  examine  the  distribution  of  the  actual  words,  or  in  some  cases  letters  or  other  graphemes,  and 
their  comparative  usage  between  texts.  In  most  cases  these  lexical  techniques  do  not  approach 
the  level  of  semantic  analysis,  where  the  words  would  have  some  inherent  “meaning”  to  the 
classifier,  but  the  words  themselves  are  counted  and  manipulated  directly. 

In  [5],  Ellegard  took  an  extremely  labor  intensive  approach  to  building  word  frequency  distribu¬ 
tions  for  determining  authorship  in  the  Junius  Letters.  He  manually  constructed  a  “distinctive¬ 
ness”  measure  similar  to  tf-idf2,  where  words  that  appeared  frequently  (or  infrequently)  in  each 
of  the  suspect  authors’  known  works,  but  which  which  do  not  appear  frequently  in  other  writ¬ 
ers’  documents,  were  highly  ranked.  Ellegard  then  manually  counted  these  “plus”  and  “minus” 
words  in  each  of  the  Junius  Letters  for  each  author,  arriving  at  a  similarity  score  for  each  author 
on  each  document.  In  the  end,  Elleagard’s  conclusion  was  that  Sir  Phillip  Francis,  the  suspected 
author  of  the  letters,  was  the  true  writer.  His  approach  was  not  without  its  faults,  however.  In 
particular,  Ellegard  did  include  content  words  in  his  lists  of  “plus”  and  “minus”  words.  It  is  now 
common  practice  to  regard  a  word  with  a  high  tf-idf  score  as  distinctive  of  the  primary  topic  of 
some  given  document  in  a  corpus.  Because  this  is  what  Ellegard  was  essentially  matching,  his 
approach  has  the  potential  to  more  closely  align  two  distinct  authors  who  write  about  similar 
topics  than  one  author  who  writes  about  disparate  topics. 

In  their  landmark  1963  and  1964  studies  on  the  Federalist  papers,  [28]  [29],  Mosteller  and 
Wallace  examine  the  Federalist  papers  with  statistical  analysis  of  word  frequencies.  According 
to  [28], 

The  Federalist  papers  were  published  anonymously  in  1787-1788  by  Alexander 
Hamilton,  John  Jay,  and  James  Madison  to  persuade  the  citizens  of  the  State  of 
New  York  to  ratify  the  Consititution.  Of  the  77  essays,  900  to  3500  words  in  length, 
that  appeared  in  newspapers,  it  is  generally  agreed  that  Jay  wrote  five:  Nos.  2,  3,  4, 

5,  and  64,  leaving  no  further  problem  about  Jay’s  share.  Hamilton  is  identified  as 
the  author  of  43  papers,  Madison  of  14.  The  authorship  of  12  papers  (Nos.  49-58, 

2Term  Frequency  -  Inverse  Document  Frequency  is  a  method  of  scaling  the  importance  of  a  term  to  a  document 
based  on  how  frequently  it  occurs  in  the  document  scaled  by  how  infrequently  it  occurs  in  all  documents  in  a 
corpus. 
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62,  and  63)  is  in  dispute  between  Hamilton  and  Madison;  finally,  there  are  also 
three  joint  papers.  Nos.  18,  19,  and  20,  where  the  issue  is  the  extent  of  each  man’s 
contribution. 


Early  manual  examination  suggested  that  the  use  of  certain  words  such  as  ‘upon,’  or  preference 
for  ‘while’  versus  ‘whilst,’  were  strong  discriminators  between  Madison  and  Hamilton.  Extend¬ 
ing  this  concept,  Mosteller  and  Wallace  constructed  a  set  of  30  words,  comprised  of  function 
words  such  as  ‘by,’  ‘of,’  and  ‘to,’  as  well  as  “well-liked”  words  such  as  ‘commonly,’  ‘vigor,’ 
and  ‘particularly,’  which  were  determined  not  to  convey  topical  meaning  and  not  to  vary  with 
context.  Examples  of  words  not  counted  in  the  study  were  ‘war,’  ‘executive,’  and  ‘legislature’ 
despite  the  fact  that  they  appeared  very  frequently,  a  standard  often  used  for  determining  func¬ 
tion  words.  Counting  the  frequencies  of  these  words  for  each  author  and  fitting  to  a  Poisson  or 
negative  binomial  distribution  (the  difference  is  “not  of  major  importance”  [28])  allows  a  model 
of  prior  probabilities  to  be  built. 


Turning  to  the  disputed  texts,  Mosteller  and  Wallace  used  Bayes’  Theorem  to  balance  the  prior 
probabilities  of  each  individual’s  potential  authorship  with  the  posterior  odds  that  each  text  was 
written  by  the  individual  given  its  word  frequencies.  In  their  example,  if  x  is  one  sample  from  a 
discrete  set  of  possible  observations,  pt  is  the  prior  probability  of  hypothesis  i  and  fi(x),i  =  1,2 
is  the  conditional  probability  of  observing  x  given  that  hypothesis  i  is  true,  then 

P {Hypothesis  1  |  x)  = 

PlJl{x)  +P2J2{X) 

Mosteller  and  Wallace  make  judgments  in  the  paper  based  on  the  “odds”  of  one  hypothesis 
being  true  over  the  other,  with  hypothesis  1  being  that  Hamilton  was  the  author  of  the  paper 
in  question  and  hypothesis  2  that  Madison  was  the  author.  Final  odds  are  defined  as  the  initial 
odds  multiplied  by  the  likelihood  ratio,  or, 


Odds(  1,  2  |  x ) 


P [Hypothesis  1  |  x) 
P(p[ypotliesis  2  |  x) 


Pifijx) 

P2f2(x) 


Pi\  ( fi(x)\ 
P2j  \h{x))  ' 


Further,  the  likelihood  ratio  for  multiple  words  is  the  product  of  the  likelihood  ratios  for  each 
word  individually  and,  to  make  the  numbers  manageable,  the  odds  can  also  be  computed  as 
a  log-likelihood.  In  [28],  the  problem  of  choosing  initial  odds  is  explained  away  through  the 
assumption  that  any  appreciable  number  of  observed  words  with  strong  likelihood  ratios  will 
quickly  overwhelm  any  variation  in  the  initial  odds.  In  a  problem  such  as  the  disputed  Federalist 
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papers,  the  initial  odds  for  Hamilton  versus  Madison  may  as  well  be  1,  or  a  50-50  chance  that 
either  individual  was  the  author. 

The  result  of  Mosteller  and  Wallace’s  study  confirms  what  historians  believed  about  the  Feder¬ 
alist  papers,  that  Madison  had  written  all  twelve  of  the  disputed  documents.  Additionally,  they 
raise  several  issues  relevant  to  the  study  of  statistical  authorship  attribution  in  general,  such 
as  the  utility  of  function  words  as  discriminating  features  and  the  observation  that  prior  distri¬ 
butions,  which  may  have  otherwise  required  human  intervention  through  scholarly  study  of  a 
disputed  text,  are  of  negligible  importance. 

2.2  Lexical  Characteristics  of  Blogs 

It  is  quite  clear  to  anyone  who  reads  blogs  that  they  are  a  unique  form  of  written  communication. 
Looking  strictly  at  their  language  use,  the  subtle  differences  between  blogs  and  other  forms  of 
writing  begin  to  emerge.  In  [27],  Mishne  provides  a  thorough  overview  of  language  use  in 
blogs  and  the  difference  between  blogs  and  other  forms  of  text.  In  particular,  he  identifies 
top  indicative  words  from  distributions  for  web,  Usenet  and  blog  genres,  noting  that  “blogs 
have  a  distinctive  personal  feel,”  but  contain  “words  related  to  personal  surroundings  [. . .  ]  and 
references  to  current  events,”  supporting  the  intuition  that  their  language  model  is  a  combination 
of  personal  correspondence  and  news  reporting. 

Mishne  also  examines  several  measures  of  lexical  difference  between  blogs  and  other  corpora 
such  as  the  Kullback-Liebler  divergence,  perplexity,  and  three  “readability”  measures.  KL  di¬ 
vergence  expresses  how  different  two  probability  distributions,  p  &  q,  are  and  is  defined  as  their 
relative  entropy,  [24] 

D(p  1 1  9)  =  57  p(x)  log  (2.1) 

fS  q(x) 

Using  a  measure  of  KL  divergence,  blogs  are  most  similar  to  “personal  letters”  (with  a  score  of 
0.25)  and  most  divergent  from  “scientific  articles”  (with  a  score  of  1.06).  Perhaps  surprisingly, 
blogs  are  significantly  different  from  “newspapers”  (with  a  score  of  0.48)  and  the  web  at  large 
(with  a  score  of  0.75). 

Turning  to  perplexity,  defined  for  the  probability  distribution  of  a  large  sample  of  text  from  the 
genre,  P,  as  2fl(r>)  where  H(P)  is  the  entropy  of  the  distribution  [24],  blogs  have  relatively 
high  scores,  averaging  301.  For  comparison,  newspapers  have  a  reported  score  of  355,  essays 
are  scored  at  295,  fiction  is  245,  and  personal  letters  are  55  [27].  Mishne  concludes: 
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The  relatively  high  perplexity  of  blog  language,  compared  with  other  genres  to 
which  it  is  similar  in  the  type  of  vocabulary,  indicates  less  regularity  in  use  of 
language:  sentences  are  more  free-form  and  informal,  and  adhere  less  to  strict  rules. 

Finally,  in  terms  of  readability,  a  measure  tied  to  the  familiar  concept  of  “grade-levels,”  Mishne 
scores  blogs  relatively  low,  ranking  them  9.9,  7.0  and  8.9  on  the  three  scales  examined  (Gunning- 
Fog3,  Flesch-Kincaid4,  and  Simple  Measure  of  Gobbledygook5,  respectively).  These  scores  are 
higher  than  fiction,  but  lower  than  both  “school”  and  “university”  essays  as  well  as  newspapers. 
Mishne  attributes  much  of  this  to  the  actual  age  of  most  bloggers,  which  is  in  the  teens,  and 
their  subsequently  shorter  sentence  and  word  lengths. 

In  general,  Mishne  concludes,  in  concert  with  other  researches  who  have  studied  the  lexical 
characteristics  of  blogs,  that  they  are  most  similar  to  school  essays.  They  clearly  have  similarity 
on  some  levels  with  the  language  usage  in  news  outlets  and  fictional  writing,  but  must  be  con¬ 
sidered  a  separate  genre  with  regard  to  the  standard  language  model  used  in  the  blogosphere. 
The  conclusion  that  blogs  do  not,  however,  generally  conform  to  a  single,  standard  language 
model,  is  encouraging  for  authorship  studies,  where  an  individual  may  not  feel  as  compelled  to 
shoehorn  their  own  style  into  the  formalized  rules  mandated  by  other  forms  for  writing. 


jGFI  =  0.4  (  (  words  )  +  100  (  comPlex™°rds  |  |  mi 

sentence  J  \  words  i  i  L  J 


*FGL  = 


0.39  (  total  words  \  +  n  g  ( total  syllables  \  _  15  59  [22] 
\  total  sentence  J  \  total  words  J  L  J 


5  SMOG  =  a  total  complex  words 


30 

total  sentences 


3  [22] 
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CHAPTER  3: 

Experiment  in  Bayesian  Classification 


3.1  Use  of  Blogs  as  Authorship  Attribution  Corpus 

3.1.1  The  Personally  Revealing  Nature  of  Blog  Writing 

The  content  of  blogs  is  typically  very  personal  to  the  writer.  All  will  express  some  unique 
viewpoint  on  the  topic  of  interest,  with  some  going  so  far  as  to  write  almost  exclusively  about 
their  personal  lives.  In  recent  years,  these  ‘diary’  type  blogs  outnumber  those  of  the  earlier 
‘filter’  and  ‘notebook’  blogs.  Some  authors  even  “define  themselves  through  their  blog.”  [27] 
[33]  If  we  regard  the  goal  of  authorship  studies  as  building  accurate  models  of  an  author’s 
particular  internal  language  model,  then  the  availability  of  a  corpus  with  this  level  of  access 
into  the  mind  of  the  author  is  a  great  asset  to  the  study  of  authorship  attribution. 

In  [13],  Holmes  suggests,  for  example,  that  sentence  length  measures  are  only  applicable  when 
the  author’s  sentence  division  intent  and  use  of  punctuation  are  preserved.  In  traditional  author¬ 
ship  studies,  where  a  document  may  have  been  edited  prior  to  publishing,  had  its  punctuation 
usage  standardized,  or  been  translated  between  languages,  this  concern  forces  researchers  to  ap¬ 
proach  these  measures  with  reservation.  Typical  blogs,  on  the  other  hand,  are  almost  invariably 
the  work  of  a  single  author  and  are  not  subject  to  the  same  level  of  editorial  scrutiny  necessary 
for  traditional  print  media. 

3.1.2  The  Technical  Suitability  of  a  Blog  Corpus  for  Authorship 

Compared  to  the  text  subjects  of  traditional  authorship  attribution  studies,  blogs  are  relatively 
easy  to  collect.  Though  the  prevalence  of  resources  such  as  Project  Gutenberg1  has  made  access 
to  classic  literature  much  more  reasonable  than  in  past  eras,  where  researchers  spent  years 
manually  counting  words  off  a  printed  page,  blogs  exist  in  a  natively  electronic  format  that  is 
readily  accessible  to  anyone  who  wishes  to  access  it. 

In  particular,  they  allow  us  to  sample  many  times  more  authors  than  could  be  reasonably  exam¬ 
ined  through  study  of  published  literature  or  student  essays  generated  for  a  particular  study,  the 
traditional  corpora  for  authorship  studies. 

'Project  Gutenberg  available  at  http://www.gutenberg.org.  Accessed  30  June,  2008. 
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Blogs  are  inherently  time  sensitive,  with  the  date  and  time  being  crucial  to  each  post’s  relevance. 
Though  we  didn’t  examine  the  chronological  aspects  of  an  author’s  writing  to  blogs,  this  infor¬ 
mation  is  available  directly  from  the  web  host,  making  blogs  an  ideal  corpus  for  researchers 
examining  the  progression  of  an  author’s  writing  style  over  time,  for  example. 

3.2  Corpus  Preparation 

The  authorship  corpus  we  are  using  was  developed  by  J.  Schler,  M.  Koppel,  S.  Argamon  and  J. 
Pennebaker  [17]  and  contains  writings  from  nearly  20,000  authors  on  blogger.com.  The  corpus 
contains  a  single  XML  file  for  each  author  in  the  corpus,  with  each  file  containing  all  posts 
by  that  author  accessible  at  the  time  of  download  (August  2004)  annotated  with  the  date  and 
time  of  posting.  All  formatting  in  the  original  HTML  blog  has  been  removed,  leaving  only 
plain  text.  Additionally,  the  original  researchers  removed  all  URL  links,  replacing  them  with 
the  token  ‘urllink.’ 

In  the  interest  of  processing  time,  the  larger  corpus  was  limited  to  at  most  2000  authors  for  each 
experiment.  To  establish  training  and  testing  sets,  at  least  10%  of  each  author’s  posts  were  set 
aside.  Each  blog’s  posts  were  first  shuffled  to  remove  any  chronological  influence.  Next,  a  size 
threshold  for  each  author’s  training  set  was  chosen  at  10%  of  the  size,  in  words,  of  all  posts  in 
the  original  file  combined.  Whole  posts  were  then  removed  from  the  original  file  and  placed  in 
a  new  testing  file  until  the  test  sample’s  size,  in  words,  met  or  exceeded  the  10%  threshold.  The 
remaining  posts  were  designated  for  training  and  written  to  a  new  file. 

Training  and  test  sets  were  both  regarded  as  bag-of- words2  models.  For  this  reason,  we  can 
treat  the  concatenation  of  all  posts  in  an  author’s  training  set  as  a  single  document,  and  likewise 
with  the  test  document.  For  each  classification  experiment,  the  test  document  size  was  further 
limited  to  100,  250,  500,  750,  and  1000  bigrams  in  order  to  test  the  improvement  in  accuracy  as 
the  size  of  the  document  in  question  increases,  a  practical  consideration  for  scenarios  when  we 
may  wish  to  classify  a  diminutive  text.  The  unit  of  classification,  therefore,  will  be  on  the  level 
of  a  partial  document,  comprised  of  concatenated  posts  and  truncated  to  the  test  length. 


2A  bag-of-words  model  is  one  where  ’’all  the  structure  and  linear  ordering  of  words  within  the  context  is 
ignored."  [24],  This  assumption  is  naive  in  that  the  frequency  of  a  word  type’s  occurrence  certainly  depends  on  its 
context,  but  evidence  suggests  that  results  are  not  severely  impacted  in  many  scenarios. 
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3.3  Model  Building  for  Each  Author 


3.3.1  Construction  of  Models  for  Training  Data 

To  cope  with  the  large  size  of  the  set  of  suspect  authors,  a  model  for  each  was  constructed  from 
the  training  data  and  saved  prior  to  classification. 

Sentence  Chunking 

Each  post  was  divided  into  sentences  so  that  we  could  retain  nominal  position  information, 
particularly  what  word  were  most  likely  to  occur  as  the  first  word  of  a  sentence  or  as  the  last. 
Boundaries  were  detected  by  the  Punkt  tokenizer,  described  in  [15].  The  Punkt  algorithm  is 
based  on  simple  division  rules  for  punctuation,  but  it  is  initially  “trained”  on  large  samples  of 
text  so  that  it  can  learn  the  nuances  of  when  punctuation  does  or  does  not  actually  indicate  the 
end  of  a  sentence.  For  example,  Punkt  will  learn  that  ’Dr.’  occurs  frequently  as  an  abbrevi¬ 
ation  but  its  period  does  not  necessarily  mark  the  end  of  a  sentence.  In  the  presence  of  other 
information  to  indicate  that  the  author  intended  to  conclude  their  sentence  with  the  abbreviation 
’Dr.,’  for  example,  the  system  will  divide  the  sentence  there.  Though  we  did  not  train  Punkt  on 
annotated  blog  data,  instead  using  standard  english  training  data,  the  algorithm  performed  very 
well  across  the  blog  corpus.  The  lack  of  strict  formalities  in  blog  writing,  however,  makes  the 
notion  of  dividing  into  traditional  sentence  inherently  difficult.  Consistency  between  training 
and  testing  data  should  mitigate  this. 

Bigram  Tokenization 

For  classification  we  focused  on  bigram  word  frequencies.  Word  n-grams  are  groupings  of  n 
words  appearing  next  to  each  other  in  the  text.  Unigrams  are,  therefore,  n-grams  with  n=l,  or 
single  word  tokens,  and  digrams  are  n-grams  with  n=2.  Use  of  bigrams  does  allow  us  to  retain 
some  notion  of  sequential  information  without  the  problem  of  sparsity  when  larger  groupings 
are  used.  Bigrams  are  determined  with  a  simple  sliding  window  such  that  each  bigram  is  the 
space-separated  string  “w;u>i+1”  for  i  =  {1,  2, ...,  n  —  l}vriierewt  is  the  word  at  position  i  in 
the  text  sample  and  n  is  the  length,  in  words,  of  the  text.  Additionally,  a  new  token,  ‘<S>’  is 
inserted  to  retain  start-of-sentence  and  end-of-sentence  position  information.  As  a  result, 

•  The  first  bigram  in  a  sentence  is  ‘<S>  firstword’ 

•  The  last  bigram  in  a  sentence  is  ‘lastword  <S>’ 
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Frequency  Distributions  of  Bigrams 

Each  model  consists  of  a  frequency  distribution,  using  the  NLTK  Frequency  Distribution  ob¬ 
ject  [21],  keyed  on  bigrams  as  space-separated  strings.  NLTK,  the  Natural  Language  Toolkit3, 
is  a  module  for  the  Python4  programming  language  containing  frequently  used  functions  for 
computational  linguistics  and  natural  language  processing  tasks.  Applied  to  bigram  token  sam¬ 
ples,  the  FreqDist  object  records  samples  and  provides  the  frequency  of  any  particular  type  as  a 
fraction  of  all  observed  samples. 

Therefore,  for  each  author  a  and  bigram  b, 

P(b  |  a)  =  (3.1) 

where  /ft  is  the  count  of  occurrences  of  b  in  a’s  training  sample, 
and  |  Dy, a  |  is  the  count  of  bigrams  in  a’s  training  sample,  T 

Additionally,  simple  Witten-Bell  smoothing5  was  applied  to  each  author’s  bigram  distribution 
in  order  to  deal  with  unseen  bigrams. 


3.3.2  Prior  probabilities  model 


The  prior  probability  of  an  author,  also  known  as  the  “initial  odds, ”[28]  is  the  probability  that 
they  wrote  the  wrote  the  document  in  question  without  regard  to  the  contents  of  the  document. 
In  our  study,  an  author’s  prior  probability  was  based  on  the  number  of  bigrams  in  their  training 
sample  as  a  fraction  of  the  number  of  bigrams  in  all  authors’  training  samples,  representing  how 
prolifically  an  author  writes  compared  to  his  or  her  peers.  The  prior  probability  of  author  a.,, 
then,  is 


P(di) 


\DT,ai\ 

y 

z-^aj 

eA 

|  D 

T,aj 

(3.2) 


3NLTK  available  at  http://www.nltk.org.  Accessed  30  June,  2008. 

4Python  available  at  http://www.python.org.  Accessed  30  June,  2008. 

5Witten  Bell  Smoothing  models  the  “probability  of  a  previously  unseen  event  by  estimating  the  probability  of 
seeing  such  a  new  event  at  each  point  as  one  proceeds  through  the  training  corpus.”  C.f.  [24]  page  222.  In  our  case, 
unseen  samples  were  approximated  by  T /Z(N  +  T)  where  T  is  the  number  of  types,  N  is  the  number  of  samples 
observed  and  Z  is  a  scaling  factor  to  ensure  mass  of  the  new  distribution  is  1.  Further,  T  is  approximated  from 
the  count  of  all  bigram  types  in  the  entire  corpus  to  estimate  the  maximum  possible  vocabulary  size.  The  exact 
number  chosen  for  this  parameter  was  of  little  importance  -  even  drastic  experimental  manipulation  produced  no 
change  in  results. 
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3.4  Naive  Bayes  Classifier 

3.4.1  Bayes’  Theorem 


Bayes’  rule  is  widely  used  for  deriving  the  conditional  probability  of  some  event,  X,  given  Y, 
based  on  the  marginal  probabilities  of  X  and  Y  and  the  probability  of  Y  conditional  on  X,  all 
of  which  may  be  easier  to  determine.  Generally  stated, 


P(X  |  Y) 


P(Y  |  X)  P(X) 

W) 


(3.3) 


Applied  to  determination  of  authorship  for  a  suspect  a  when  a  test  feature  vector, 
served, 


P{a  |  Ft) 


P(Ft  |  a)  P(a) 

P(  Ft) 


Ft, 


is  ob- 
(3.4) 


If  a  set  of  potential  authors,  A  is  known,  the  most  probable  among  them  is  the  one  with  the 
highest  probability,  or 


a 


* 


argmax 

CLj,£.A 


P(Ft  |  di)  P(a,i) 
P(  Ft) 


(3.5) 


Because  the  term  P(Ft)  does  not  change  between  authors,  the  argmax  operator  allows  us  to 
discard  it, 


a*  =  argmax 

CLi^A 


di)  P(di ) 


(3.6) 


Making  the  “naive”  assumption  that  each  element  of  the  feature  vector  Ft  is  independent  of 
every  other  element,  we  can  arrive  at  P(Ft  |  di)  by  taking  the  product  of  each  element, 


a  =  argmax 

aid  A 


pm  n  p(i. 

/,eFt 


j  \  di) 


(3.7) 


Finally,  because  the  product  of  small  probabilities  quickly  becomes  unmanageably  small,  we 
instead  take  the  sum  of  the  log-probabilities 


d  =  argmax 
£  A 


log  P(di 


Y. 

/j-eF  t 


log  P(fj  |  di 


(3.8) 
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3.4.2  Extension  of  classifier  for  ranking 


For  our  evaluation,  it  is  more  advantageous  to  assign  each  potential  author  a  score6  by  which 
they  may  be  compared  and  ranked  rather  than  simply  returning  the  single  most  probable  author. 

S(ai  |  Ft)  =  log P(ai)  +  Y,  log  P(fj  I  <k)-  (3-9) 

fj£  Ft 


The  single  most  probable  author  can  still  be  chosen,  of  course,  by 

a*  =  argmax  S  (at  |  Ft).  (3.10) 

CL’i^.A 


3.5  Corrective  transformation  of  results 

Observation  of  all  authors’  scores  from  each  test  sample  revealed  that  a  limited  number  of  au¬ 
thors  with  the  highest  prior  probabilities  were  overwhelmingly  returned  as  the  “correct”  author 
by  the  classifier.  Figure  3.1,  a  confusion  matrix  of  Author  ID’s,  illustrates  this  discrepancy. 
Note  that  the  higher  author  ID’s  belong  to  authors  with  higher  prior  probabilities. 

Points  on  the  diagonal  are  authors  who  were  correctly  classified,  defined  as  having  the  highest 
score  on  the  test  document  sampled  from  his  or  her  blog.  It  is  apparent  from  fig.  3.1  that 
regardless  of  the  prior  probability  of  the  true  author  for  any  arbitrary  test  sample,  the  Bayesian 
classifier  returned  one  of  the  few  authors  with  the  highest  prior  probabilities,  that  is,  authors 
who  wrote  quite  prolifically. 

Examination  of  the  results  from  a  single  test  in  Fig  3.2,  where  the  true  author  of  the  document 
in  question  was  ranked  the  49th  most  probable  of  2000  authors,  demonstrates  that  the  prior 
probability  of  a  suspect  author  is  closely  correlated  with  their  ranking  in  this  test.  It  is  quite 
clear  from  examination  of  the  plot,  however,  that  the  true  author  has  a  lower  prior  probability 
than  those  who  are  similarly  ranked  on  their  scores.  That  is,  the  true  author  has  a  much  lower 
prior  probability  relative  to  his  or  her  score  than  do  the  authors  ranked  48th,  47th,  50th,  51st,  etc. 
This  trend  was  observed  in  many  tests  when  manually  examined. 


6This  “score”  does  not  represent  an  absolute  “probability”  that  the  given  author  wrote  a  text  sample.  Instead 
it  is  a  strictly  comparative  measure  within  a  single  test.  In  particular,  no  evidence  was  found  to  suggest  that  this 
score  represents  a  level  of  confidence  in  the  classification  or  other  such  metric  that  could  be  compared  between  test 
samples. 
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Classifier  Score 


Confusion  Matrix  for  2000  Test  Documents  of  500  Bigrams  Each 

Classified-As  Author  ID 
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Figure  3.1 :  Confusion  Matrix  for  All  Test  Documents  of  500  Bigrams  Each 
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Figure  3.2:  Full  results  of  a  single  test,  before  transformation,  ordered  by  score 
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Prior  Probability 


Removing  the  prior  probability  term  in  the  Bayesian  classifier  had  little  effect  in  correcting  this 
influence.  Instead,  we  normalized  the  scores  and  negative-log-priors  such  that  the  maximum 
(best)  score  became  zero  and  the  lowest  score  became  -1,  and  the  most  prolific  author  had  a 
log-prior  of  1  and  the  least  prolific  had  a  log-prior  of  zero. 


□ 


•  ...  •;.*  '-'V:'..  •  ' 

.  .  .-  •  •  • 

7  .  y  =  0.913x  -  0.9525 


□  True  Author  Score 
(PreTrans) 

•  Untransformed  Score 
—  —  Linear  (Untransformed 


Negative  Log  Prior  (Scaled  0-1) 


Figure  3.3:  Normalized  results  of  a  single  test,  before  transformation 


Plotting  the  output  from  a  single  test  in  Figure  3.3  allows  a  least  squares  linear  regression  to  be 
performed  and  a  scaling  factor  (3,  unique  to  the  current  test,  to  be  obtained. 


Efc  (l°g p(ak)  -  log P(a)j  yS(ak)  -  S(a)J 
(log  P(ak)  —  logP(a)  j  2 

Using  the  slope  of  the  regression  line,  (3,  as  a  corrective  factor  allows  a  modified  score  to  be 
calculated  for  each  data  point  in  the  test  results,  shown  in  figure  3.4.  The  results  can  then  be 
re-sorted  on  this  new  score  and  the  most  probable  author  determined  by  the  maximum  S'. 

S'(oi  |  Ft)  =  S(oi  I  Ft)  log P(ai)  (3.12) 

=  (1  -$)  log  p(ai)+  ]T  log  P(fj  I  at)  (3.13) 

/j-e  Ft 

S'  for  each  is  a  corrected  score  where  we  are  essentially  discounting  more  or  less  of  the  influence 
of  that  author’s  prior  probability  based  on  the  scaling  factor  (3  as  determined  by  the  slope  of  the 
regression  line  through  all  authors’  scores. 
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Figure  3.4:  Normalized  results  of  a  single  test,  after  corrective  transformation.  Compare  to  Figure  3.3 
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Figure  3.5:  Confusion  Matrix  for  All  Test  Documents  of  500  Bigrams  Each  after  corrective  transformation.  Compare 
to  Figure  3.1 
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In  figure  3.4,  a  plot  of  S'  for  the  same  example  test  as  figure  3.3,  the  true  author  is  assigned 
the  highest  S'  and  is,  visually,  clearly  distinguishable.  Examination  of  many  test  documents 
suggests  that  this  pattern  occurs  with  great  regularity. 

Comparing  the  confusion  matrix  from  before  corrective  transformation,  figure  3. 1,  to  the  confu¬ 
sion  matrix  of  all  tests  after  corrective  transformation  has  been  performed,  figure  3.5,  it  is  clear 
that  the  transformation  not  only  dramatically  improves  classifier  accuracy,  with  many  more 
points  on  the  diagonal,  but  it  also  removes  the  strong  bias  toward  classifying  all  samples  as  one 
of  the  few  authors  with  the  highest  priors. 


|  Pre-Transformation  | 

]  Post-Transformation  | 

Count 
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(of  2000) 

(of  2000) 
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ro  oi 
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at 
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1000 

217 
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74% 

Table  3.1 :  Count  of  authors  classified  exactly  correctly  among  2000  suspects,  for  2000  test  documents 


3.6  N-Percent- Correct  Threshold 

In  many  cases  it  is  desirable  for  an  author  with  a  score  in  some  top  threshold  of  all  scores  to 
be  regarded  as  a  “match”  rather  than  the  author  with  the  single  maximum  score.  This  thresh¬ 
old  could  be  used  to  reduce  the  search  space  of  many  thousands  of  potential  authors  to  a  few 
likely  candidates  with  a  high  degree  of  certainty,  making  the  job  of  a  human  investigator  or 
more  sophisticated  classification  much  more  manageable.  Regarding  the  problem  as  a  task  of 
authorship  discovery  rather  than  authorship  verification ,  the  utility  of  this  metric  is  apparent. 


|  Pre-Transformation  ] 

|  Post-Transformation  | 
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Table  3.2:  Relaxing  an  “exactly  correct”  classification  to  tests  where  the  true  author  was  ranked  first  or  second 
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Figure  3.6:  Classifier  accuracy  at  progressively  relaxing  n-percent-correct  threshold 


Threshold 

Figure  3.7:  Detailed  view  of  classifier  accuracy  at  progressively  relaxing  n-percent-correct  threshold 
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Table  3.2  indicates  the  number  of  correct  classifications  made  among  2000  test  samples  if  we 
define  a  correct  classification  as  the  true  author  being  within  the  top  n  =  0.1%  when  sorted  on 
S',  that  is,  ranked  first  or  second.  Supposing  we  are  willing  to  accept  a  new  set  of  suspects  that 
is  a  fraction  the  size  of  the  original  set7,  accuracy  can  be  improved  significantly. 

The  curves  in  figures  3.6  and  3.7  represent  classifier  accuracy  as  the  n-percent  threshold  is 
progressively  relaxed  and  as  more  or  less  testing  data  is  available.  For  example,  suppose  we 
wish  to  identify  the  author  of  a  sample  of  500  bigrams  from  among  2000  possible  authors.  If 
we  are  willing  to  accept  a  new  set  of  100  possible  authors,  a  reduction  of  the  search  space  to  just 
5.0%  of  it’s  original  size,  the  probability  that  the  true  author  is  among  the  new  subset  is  91.0%. 
If  1000  bigrams  were  available  the  probability  of  a  “correct”  classification  rises  to  95.4%.  Even 
with  100  bigram  test  sample  sizes,  after  transformation  we  can  reduce  the  search  space  by  half 
with  greater  than  95%  assurance  that  the  true  author  is  in  the  new  subset  of  potential  authors. 
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Table  3.3:  Classifier  accuracy  for  progressively  relaxing  n-percent  threshold,  where  classification  within  the  top 
n-percent  of  all  scores  is  considered  “correct” 

Figures  3.6  and  3.7  also  include  plots  of  classifier  accuracy  before  transformation.  For  test  sam¬ 
ple  sizes  of  250  bigrams  or  larger  the  search  space  could  be  reduced  by  half  with  90%  or  greater 
accuracy,  but  attempting  to  reduce  the  size  further  resulted  in  severely  degraded  accuracy.  Only 
with  the  corrective  scaling  were  we  able  to  both  limit  the  search  space  to  a  reasonably  small 
size  and  to  do  so  with  a  high  degree  of  accuracy. 


7We  also  discount  the  possibility  that  the  true  author  may  not  be  represented  in  the  original  search  space  at  all. 
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CHAPTER  4: 
Discussion  of  Results 


4.1  Effect  of  test  sample  size 

One  area  in  which  this  thesis  differs  from  similar  studies  in  authorship  attribution  is  our  ex¬ 
amination  of  the  influence  of  limited  test  sample  sizes  on  classification  accuracy.  It  is  easily 
conceivable  that  practical  authorship  attribution  problems  would  require  methods  that  are  ac¬ 
curate  even  on  very  short  samples  of  text,  such  as  might  be  found  in  a  very  short  blog  post,  a 
comment  on  a  blog,  a  short  email,  or  a  sentence  or  two  appearing  on  a  wiki.  Past  research  has 
explored  the  possibility  of  authorship  attribution  where  the  very  nature  of  the  text  is  short,  such 
as  in  poetry  [32],  but  we  are  aware  of  none  that  addresses  the  possible  degradation  of  accuracy 
in  cases  where  only  a  few  sentences  are  available. 

The  spacing  of  the  curves  in  figure  3.6  reveals  insight  into  the  classifier’s  performance  on 
smaller  test  sample  sizes.  As  we  increased  the  available  test  sample  size  from  100  to  250  to 
500,  and  so  on,  the  accuracy  improved  logarithmically,  with  diminishing  returns  from  increased 
data. 

This  supports  the  obvious  intuition  that  the  more  test  data  we  can  gather,  the  better  our  classi¬ 
fier’s  performance  will  be.  The  time  penalty  is  not  so  significant  that  we  would  ever  want  to 
limit  the  size  artificially  for  actual  problems  of  determining  authorship.  It  does  also  suggest, 
though,  that  this  type  of  classifier  is  a  good  choice  for  situations  where  the  available  test  data  is 
very  limited.  A  test  sample  size  of  500  bigrams  was  often  used  as  the  baseline  for  comparison 
in  this  study,  and  represents  a  level  where  the  search  space  may  be  reduced  most  dramatically 
with  a  high  level  of  confidence,  for  example  reducing  the  search  space  to  5.0%  of  it’s  original 
size  with  an  accuracy  over  90%,  or  classifying  over  half  of  the  test  samples  exactly  correct.  500 
bigrams  is,  however,  more  text  than  may  be  available  in  many  problems.  For  reference,  this 
paragraph  is  less  than  200  bigrams  in  length  and  is  typical  of  the  size  of  a  single  blog  post  in 
our  corpus. 
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4.2  Inclusion  of  content  words 

We  must  point  out  that  the  inclusion  of  content  words  as  features  for  the  classifier  may  be  a 
critical  flaw  in  the  application  of  this  approach  to  real-world  problems  of  authorship  discovery. 
We  suspect  the  results  would  not  have  been  so  positive  had  we  restricted  our  study  to  function 
words  or  otherwise  abstracted  away  the  effect  of  topic  and  context.  In  particular,  it  is  impractical 
or  impossible  to  construct  a  large  scale  scenario  where  the  classifier  is  trained  on  samples  of 
one  subject  and  tested  on  samples  of  text  on  a  significantly  different  topic,  but  we  suspect  that 
attribution  would  be  given  to  the  authors  of  training  samples  which  align  more  closely  to  the 
test  on  topic  than  on  authorship. 

For  example,  in  an  intelligence  situation,  suppose  we  wish  to  discover  the  author  of  an  anony¬ 
mous  text  sample  that  discusses  detailed  plans  to  use  homemade  chemical  weapons.  The  true 
author  of  the  message  maintains  a  blog  where  he  discusses  his  daily  life  but  does  not  address  his 
clandestine  activities  and  therefore  uses  few  of  the  same  context  specific  words  in  his  blog  as 
were  used  in  the  test  sample.  Perhaps  other  indicative  terms  or  idiosyncratic  spelling  would  im¬ 
prove  the  true  author’s  rank  slightly,  but  his  score  would  be  quickly  swamped  by  other  bloggers 
such  as  legitimate  chemical  engineers  who  use  the  same  context  specific  words  in  their  training 
samples. 

Of  course,  situations  also  exist  in  which  matching  the  topic  and  context  are  advantageous,  such 
as  in  a  plagarism  investigation.  Suppose  a  sample  of  text  from  a  paper  is  believed  to  have  come 
from  a  blog  source  but  does  not  give  credit  and  does  not  reuse  exact  text  strings  from  the  original 
source,  making  it  difficult  to  find  the  original  source.  Bayesian  classification  using  all  words  as 
features  would  be  more  likely  to  reveal  the  source,  causing  it  to  emerge  from  the  many  blogs 
on  other  topics  and  hopefully  returning  it  in  a  small  subset  of  possible  blogs  which  match  most 
closely. 

4.3  Corrective  Scaling 

Unfortunately  the  full  explanation  of  why  performing  regression  on  the  results  of  a  classifier 
has  such  a  dramatic  effect  is  not  known.  In  particular,  we  have  not  determined  whether  we  can 
arrive  at  the  same  scaling  factor,  /3,  through  some  other  means  or  otherwise  replicate  its  effect. 
For  example,  a  more  sophisticated  back-off  scheme  for  determining  feature  probabilities  would 
reduce  the  occurrences  of  unseen  bigrams,  which  we  suspect  would  improve  results  without 
post-classification  transformation  and  would  be  likely  to  make  post-classification  transforma- 
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tion  less  beneficial  or  entirely  unnecessary.  Other  possibilities  are  that  latent  semantic  analysis 
or  some  other  transformation  of  the  space  before  classification  could  have  a  similar  effect. 

The  shortcoming  of  this  method  is  that  it  requires  accurate  estimates  of  prior  probabilities  for  all 
authors,  which  may  not  always  be  known.  In  our  study,  the  prior  probabilities  were  determined 
by  the  fraction  of  the  entire  training  corpus  that  was  attributed  to  each  author  in  question.  If  this 
does  not  accurately  represent  the  true  prior  probabilities  of  all  authors,  scaling  the  flawed  priors 
by  the  factor  (3  would  not  be  likely  to  produce  meaningful  results. 
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CHAPTER  5: 

Summary  and  Future  Work 


5.1  Summary  of  Experiment  in  Bayesian  Classification 

In  this  experiment  we  constructed  bigram  word  frequency  models  for  2000  authors  at  a  time  and 
used  them  to  then  classify  an  unseen  test  sample  from  each  author.  The  classifier  used  was  based 
on  Bayes’  rule,  but  used  a  scoring  scheme  to  return  authors  in  ranked  order  from  most  probable 
to  least.  Further,  least  squares  regression  on  the  scores  from  each  classification  test  allowed 
us  to  compute  a  scaling  factor,  (3,  which  was  used  to  discount  each  potential  author’s  score. 
Reordering  the  results  by  the  modified  score  produced  dramatic  improvements  demonstrated  in 
table  3.1. 

We  also  introduce  the  concept  of  an  ‘n-percent-correct  threshold.’  When  the  list  of  suspect 
authors  is  extremely  large,  it  is  not  only  extraordinarily  difficult  to  reliably  classify  a  document 
in  question  to  the  single  true  author,  but  it  may  not  always  be  necessary.  Many  cases  exist  where 
returning  a  subset  of  the  original  search  space  that  is  some  n%  the  size  of  the  original  can  be 
a  very  useful  result.  In  figures  3.6  and  3.7  we  demonstrate  that  as  the  threshold  is  relaxed,  say 
from  1%  to  5%,  the  cumulative  percentage  of  test  documents  classified  “correctly”  within  the 
threshold  increases  significantly.  For  large  sample  sizes,  we  are  able  to  achieve  95%  accuracy 
by  defining  a  “correct”  result  as  being  classified  within  the  top  5%,  a  reduction  from  2000 
possible  authors  to  just  40  in  our  experiments. 

As  expected,  when  the  test  samples  were  allowed  to  be  larger,  classifier  accuracy  improved. 
However,  the  classifier  performed  well  (90%  accuracy  when  reducing  the  search  space  to  1/4  its 
original  size,  for  example)  even  on  samples  as  small  as  100  bigrams.  Additionally,  there  were 
diminishing  returns  with  test  sample  sizes  above  500  bigrams,  suggesting  that  large  samples  are 
not  required  to  use  this  technique  effectively. 

5.2  Practical  Application  of  Techniques 

Despite  the  possibility  that  inclusion  of  content  words  introduces  a  significant  flaw  into  this 
technique,  we  believe  it  has  a  high  level  of  utility  for  practical  problems.  Though  2000  suspects 
is  significantly  larger  than  any  other  known  studies  of  authorship  attribution,  discovering  an 
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author  among  the  approximately  113  million  or  more  active  bloggers1  is  a  thoroughly  daunting 
task.  Even  if  this  technique  cannot  provide  investigators  with  the  single  true  author  on  the  first 
pass,  it  certainly  can  give  them  a  starting  point  for  further  investigation. 

The  scalability  of  this  technique  would  need  to  be  further  examined  and  optimized  before  it 
could  be  used  in  real  world  situations.  Building  bigram  frequency  models  for  10,000  authors 
took  several  hours  of  processing  time,  and  classification  of  a  document  among  2000  suspects 
required  1-2  minutes  per  document  in  question.  If  the  search  space  was  enlarged  beyond  what 
could  fit  into  memory,  disk  access  delay  caused  classification  times  to  degrade  to  several  min¬ 
utes  per  document  in  question.  These  are  acceptable  times  in  research,  but  building  and  storing 
models  is  not  likely  to  be  practical  in  a  deployed  system. 

5.3  Future  Directions 

The  potential  for  further  study  in  this  area  is  exciting  and  limitless.  Unfortunately  the  influence 
of  content  words  on  this  particular  technique  has  dissuaded  us  from  pursuing  it  further  as  it 
exists,  but  could  spawn  additional  studies  to  determine  the  best  method  to  abstract  away  the 
influence  of  topic  and  context. 

5.3.1  Abstracting  away  topic  and  context  influence 

We  have  already  begun  investigating  the  possibility  of  tagging  blog  data  with  part-of-speech 
tags  and  building  frequency  models  for  POS  n-grams.  This  abstraction  would  not  only  remove 
the  actual  content  from  language  (e.g.,  abstracting  the  use  of  both  ‘dog’  and  ‘computer’  to 
the  same  token,  ‘noun’)  but  would  reduce  the  feature  vector  sizes  by  orders  of  magnitude  and 
decrease  the  sparsity  of  an  individual’s  model. 

As  a  consequence  of  compacting  the  feature  vector  sizes,  we  have  been  able  to  examine  the 
use  of  Markov  chains  to  model  a  particular  individual’s  language  use  with  encouraging  results. 
Comparing  probabilities  of  suspect  authors  using  first-order  chains  has  not  performed  as  well 
as  simple  chi-squared  comparison  of  frequency  distributions  for  POS  n-grams 

Early  results  also  suggest  that  these  techniques  are  better  suited  for  smaller  scale  problems  such 
as  determining  authorship  among  sets  of  suspects  no  larger  than  100-200. 


'Source:  Blogs  currently  tracked  by  Technorati.com  as  of  June  2008. 
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5.3.2  Further  examination  of  authorship  discovery 

We  believe  additional  techniques  for  authorship  discovery  should  be  explored  in  order  to  effec¬ 
tively  and  efficiently  address  the  many  real-world  situations  in  which  they  could  be  useful.  This 
could  include  application  of  a  combination  of 

•  Forensic  techniques  such  as  IP  address  geolocation  from  data  logs 

•  Text  statistic  measures  such  as  sentence  or  word  lengths,  in  particular  used  as  heuristics 
to  quickly  discount  and  eliminate  potential  authors  from  the  search  space 

•  Determining  an  author’s  age,  location,  education  level  or  other  metadata  from  the  text  of 
the  blog  itself  if  it  is  not  provided  by  the  author. 

•  Methods  to  quantify  style  such  as  parsing  sentences  and  determining  an  author’s  prefer¬ 
ential  sentence  structure  or  generative  grammar  rules 

•  Automatically  discovering  interconnectedness  of  suspect  authors  to  the  same  entities  as  a 
document  in  question 

•  Discovering  language  use  patterns  which  are  likely  to  result  from  an  author  intentionally 
obfuscating  their  identity  such  as  using  words  with  similar  meanings  and  connotations  in 
two  documents  without  repeating  the  actual  grapheme  itself. 

Additionally,  the  determination  of  whether  training  and  testing  data  must  be  from  the  same 
source  could  have  significant  implications  for  practical  authorship  discovery.  For  example,  we 
suspect  that  an  email  could  be  classified  accurately  using  training  data  from  blogs,  or  that  an 
addition  to  a  wiki  could  be  attributed  by  examining  potential  authors’  blogs.  Testing  of  this 
hypothesis  would  require  a  very  specific  corpus.  Validation  of  the  technique,  though,  could  be 
of  great  use,  particularly  in  criminal  investigations  where  use  of  such  a  classification  as  evidence 
would  require  that  it  be  accepted  by  the  scientific  community  and  that  such  a  study  be  published 
in  the  scientific  literature. 

5.3.3  Study  of  stylochronometry  in  blogs 

Further  areas  of  interest  which  arise  from  authorship  attribution  studies  in  blogs  include  the 
possibility  of  studying  stylochronometry,  quantifying  the  progression  of  an  author’s  writing 


31 


style  through  the  lifespan  of  his  or  her  blog.  This  could  be  used  to  determine  whether  an  author 
is  likely  to  develop  as  a  writer,  perhaps  improving,  through  the  act  of  maintaining  a  blog  and 
whether  his  or  her  progression  is  comparable  to  that  of  a  writer  from  another  genre.  It  may 
be  possible,  as  well,  to  determine  when  an  unknown  sample  was  written  in  relation  to  dated 
blog  posts.  This  type  of  study  has  been  performed  on  classic  literature,  but  blogs  provide  a 
convenient  corpus  for  testing  these  techniques  because  it  is  easy  to  draw  a  test  sample  with  an 
absolutely  known  date  and  time. 

5.3.4  Discovering  new  blogs  of  interest 

Finally,  study  of  the  language  of  blogs  could  be  of  great  public  interest.  Authorship  discovery 
techniques  on  a  large  scale  may  also  be  useful  for  discovering  blogs  of  interest.  For  exam¬ 
ple,  blogs  which  are  similarly  ranked  by  some  metric  could  potentially  share  topical  coverage, 
stylistic  preferences,  or  both.  A  reader  who  “likes”  one  blog  could  use  these  techniques  to 
find  others  which  they  may  be  interested  in  reading  in  a  much  more  robust  way  than  keyword 
searching  or  other  currently  available  techniques. 
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